Objectives

  • Demonstrate lubridate for working with dates in R and ggplot2
  • Define correlation
  • Review how to implement smoothing lines
  • Generate and interpret scatterplot matricies and correlation heatmaps
  • Introduce and generate parallel coordinate plots
  • Identify methods for implementing three-dimensional graphs in R
library(tidyverse)
library(ggthemes)
library(knitr)
library(broom)
library(stringr)
library(modelr)

options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())

Working with dates in R

For more details on the lubridate package, check out R for Data Science.

When importing data, date variables can be somewhat tricky to correctly store and utilize. In a spreadsheet, tabular format, dates by default will appear either as numeric (20174018) or string (2016-04-18, April 18th, 2017, etc.) columns. If you want to perform tasks such as: extracting and summarizing over individual components (year, month, day, etc.), we need to represent dates in a different, yet standardized, format.

lubridate is a tidyverse package that facilitates working with dates (and date-times) in R.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date

Formatting dates

When using readr to import data files, R will use parse_date() or parse_datetime() to try and format any columns it thinks contain dates or date-times. To manually format dates from strings, use the appropriate function combining y, m, and d in the proper order depending on the original format of the date:

ymd("2017-01-31")
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"

Extracting date components

Let’s practice extracting components of dates using an example dataset. flights-departed.csv is a time series data file containing the daily number of departing commercial flights in the United States from 1988-2008.

(flights <- read_csv("data/flights-departed.csv"))
## Parsed with column specification:
## cols(
##   date = col_date(format = ""),
##   value = col_integer()
## )
## # A tibble: 7,671 × 2
##          date value
##        <date> <int>
## 1  1988-01-01 12681
## 2  1988-01-02 13264
## 3  1988-01-03 13953
## 4  1988-01-04 13921
## 5  1988-01-05 13932
## 6  1988-01-06 13157
## 7  1988-01-07 11159
## 8  1988-01-08 11631
## 9  1988-01-09 12045
## 10 1988-01-10 13160
## # ... with 7,661 more rows

We will use ggplot2 to generate several graphs based on the data. The first is a simple line plot over time of the daily commercial flights. To build this, we don’t need to modify flights:

ggplot(flights, aes(date, value)) +
  geom_line() +
  labs(x = NULL,
       y = "Number of departing commercial flights")

But this is quite noisy. Instead, let’s draw a line plot depicting commercial flights over a one-year period, with separate lines for each year in the data (1988, 1989, 1990, etc.). To do that, we need to create a new variable year which will serve as our grouping variable in ggplot():

(flights <- flights %>%
  mutate(year = year(date),
         yday = yday(date),
         # hack to label the x-axis with months
         days = dmy(format(date,"%d-%m-2016"))))
## # A tibble: 7,671 × 5
##          date value  year  yday       days
##        <date> <int> <dbl> <dbl>     <date>
## 1  1988-01-01 12681  1988     1 2016-01-01
## 2  1988-01-02 13264  1988     2 2016-01-02
## 3  1988-01-03 13953  1988     3 2016-01-03
## 4  1988-01-04 13921  1988     4 2016-01-04
## 5  1988-01-05 13932  1988     5 2016-01-05
## 6  1988-01-06 13157  1988     6 2016-01-06
## 7  1988-01-07 11159  1988     7 2016-01-07
## 8  1988-01-08 11631  1988     8 2016-01-08
## 9  1988-01-09 12045  1988     9 2016-01-09
## 10 1988-01-10 13160  1988    10 2016-01-10
## # ... with 7,661 more rows
ggplot(flights, aes(days, value)) +
  geom_line(aes(group = year), alpha = .2) +
  geom_smooth(se = FALSE) +
  scale_x_date(labels = scales::date_format("%b")) +
  labs(x = NULL,
       y = "Number of departing commercial flights")
## `geom_smooth()` using method = 'gam'

Or we could summarize the distribution of departing commercial flights by days in each month over the 20 year time period:

(flights <- flights %>%
  mutate(month = month(date, label = TRUE)))
## # A tibble: 7,671 × 6
##          date value  year  yday       days month
##        <date> <int> <dbl> <dbl>     <date> <ord>
## 1  1988-01-01 12681  1988     1 2016-01-01   Jan
## 2  1988-01-02 13264  1988     2 2016-01-02   Jan
## 3  1988-01-03 13953  1988     3 2016-01-03   Jan
## 4  1988-01-04 13921  1988     4 2016-01-04   Jan
## 5  1988-01-05 13932  1988     5 2016-01-05   Jan
## 6  1988-01-06 13157  1988     6 2016-01-06   Jan
## 7  1988-01-07 11159  1988     7 2016-01-07   Jan
## 8  1988-01-08 11631  1988     8 2016-01-08   Jan
## 9  1988-01-09 12045  1988     9 2016-01-09   Jan
## 10 1988-01-10 13160  1988    10 2016-01-10   Jan
## # ... with 7,661 more rows
ggplot(flights, aes(month, value)) +
  geom_violin() +
  geom_boxplot(width = .1, outlier.shape = NA) +
  labs(x = NULL,
       y = "Number of departing commercial flights")

Hmmm, there seems to be an outlier in September. What’s up with that?

Finally, we can generate a heatmap depicting the change over time of this data by creating a calendar-like visualization.1 In order do this, we need the following grammar for the graph:

  • Layer
    • Data - flights
    • Mapping
      • \(x\) - weekday (e.g. Sunday, Monday, Tuesday)
      • \(y\) - week in month (e.g. first week, second week, third week)
      • Fill - value (number of departing flights)
    • Statistical transformation (stat) - identity
    • Geometric object (geom) - geom_tile()
    • Position adjustment (position) - none
  • Scale
    • Fill - low and high-end colors (use shading to identify in-between values)
  • Coordinate system - Cartesian coordinate plane
  • Faceting - facet_grid() (year X month)

In order to generate this graph then, we need to create several new variables for flights:

  • Year
  • Month
  • Weekday
  • Week-in-month

We can use lubridate to directly generate three of those variables (we’ve already generated year and month):

(flights <- flights %>%
  mutate(weekday = wday(date, label = TRUE)))
## # A tibble: 7,671 × 7
##          date value  year  yday       days month weekday
##        <date> <int> <dbl> <dbl>     <date> <ord>   <ord>
## 1  1988-01-01 12681  1988     1 2016-01-01   Jan     Fri
## 2  1988-01-02 13264  1988     2 2016-01-02   Jan     Sat
## 3  1988-01-03 13953  1988     3 2016-01-03   Jan     Sun
## 4  1988-01-04 13921  1988     4 2016-01-04   Jan     Mon
## 5  1988-01-05 13932  1988     5 2016-01-05   Jan    Tues
## 6  1988-01-06 13157  1988     6 2016-01-06   Jan     Wed
## 7  1988-01-07 11159  1988     7 2016-01-07   Jan   Thurs
## 8  1988-01-08 11631  1988     8 2016-01-08   Jan     Fri
## 9  1988-01-09 12045  1988     9 2016-01-09   Jan     Sat
## 10 1988-01-10 13160  1988    10 2016-01-10   Jan     Sun
## # ... with 7,661 more rows

We use label = TRUE to generate factor labels for these values (January, February, March) instead of the numeric equivalent (1, 2, 3).

To generate the final week-in-month variable, we need to combine a few lubridate functions to get exactly what we want:

(flights <- flights %>%
  # generate variables for week in the year (1-54) and the day in the year (1-366)
  mutate(week = week(date),
         yday = yday(date)) %>%
  # normalize to draw calendar correctly - wday should represent the number of days from the Sunday of the week containing January 1st, then adjust based on that
  group_by(year) %>%
  mutate(yday = yday + wday(date)[1] - 2,
         week = floor(yday / 7)) %>%
  group_by(year, month) %>%
  mutate(week_month = week - min(week) + 1))
## Source: local data frame [7,671 x 9]
## Groups: year, month [252]
## 
##          date value  year  yday       days month weekday  week week_month
##        <date> <int> <dbl> <dbl>     <date> <ord>   <ord> <dbl>      <dbl>
## 1  1988-01-01 12681  1988     5 2016-01-01   Jan     Fri     0          1
## 2  1988-01-02 13264  1988     6 2016-01-02   Jan     Sat     0          1
## 3  1988-01-03 13953  1988     7 2016-01-03   Jan     Sun     1          2
## 4  1988-01-04 13921  1988     8 2016-01-04   Jan     Mon     1          2
## 5  1988-01-05 13932  1988     9 2016-01-05   Jan    Tues     1          2
## 6  1988-01-06 13157  1988    10 2016-01-06   Jan     Wed     1          2
## 7  1988-01-07 11159  1988    11 2016-01-07   Jan   Thurs     1          2
## 8  1988-01-08 11631  1988    12 2016-01-08   Jan     Fri     1          2
## 9  1988-01-09 12045  1988    13 2016-01-09   Jan     Sat     1          2
## 10 1988-01-10 13160  1988    14 2016-01-10   Jan     Sun     2          3
## # ... with 7,661 more rows

Now that we have the data correctly formatted and all the components are extracted, we can draw the graph:

ggplot(flights, aes(weekday, week_month, fill = value)) +
  facet_grid(year ~ month) +
  geom_tile(color = "black") +
  scale_fill_continuous(low = "green", high = "red") +
  scale_x_discrete(labels = NULL) +
  scale_y_reverse(labels = NULL) +
  labs(title = "Domestic commercial flight activity",
       x = NULL,
       y = NULL,
       fill = "Number of departing flights") +
  theme_void() +
  theme(legend.position = "bottom")

Aha, now the outlier makes sense. In the days following the September 11th attacks, the United States grounded virtually all commercial air traffic.

Smoothing lines

When examining multivariate continuous data, scatterplots are a quick and easy visualization to assess relationships. However if the data points become too densely clustered, interpreting the graph becomes difficult. Consider the diamonds dataset:

p <- ggplot(diamonds, aes(carat, price)) +
  geom_point() +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = "Carat size",
       y = "Price")
p

What is the relationship between carat size and price? It appears positive, but there are also a lot of densely packed data points in the middle of the graph. Smoothing lines are a method for summarizing the relationship between variables to capture important patterns by approximating the functional form of the relationship. The functional form can take on many shapes. For instance, a very common functional form is a best-fit line, also known as ordinary least squares (OLS) or simple linear regression. We can estimate the model directly using lm(), or we can directly plot the line by using geom_smooth(method = "lm"):

p +
  geom_smooth(method = "lm", se = FALSE)

The downside to a linear best-fit line is that it assumes the relationship between the variables is additive and monotonic. Therefore the summarized relationship between carat size and price seems wildly incorrect for diamonds with a carat size larger than 3. Instead we could use a generalized additive model which allow for flexible, non-linear relationships between the variables while still implementing a basic regression approach:2

p +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam'

Locally weighted scatterplot smoothing (local regression, LOWESS, or LOESS) fits a separate non-linear function at each target point \(x_0\) using only the nearby training observations. This method estimates a regression line based on localized subsets of the data, building up the global function \(f\) point-by-point.

Here is an example of a local linear regression on the ethanol dataset in the lattice package:

The LOESS is built up point-by-point:

One important argument you can control with LOESS is the span, or how smooth the LOESS function will become. A larger span will result in a smoother curve, but may not be as accurate. A smaller span will be more local and wiggly, but improve our fit to the training data.

LOESS lines are best used for datasets with fewer than 1000 observations, otherwise the time and memory usage required to compute the line increases exponentially.

Coefficient of correlation (\(r\))

  • Produces a measure of association, known as Pearson’s \(r\), that gauges the direction and strength of a relationship between two continuous variables
  • Scales between \(-1\) and \(+1\)
  • \(-1\) – perfect negative association between the variables
  • \(+1\) – perfect positive association between the variables
  • \(0\) – no relationship between the variables
  • Unit-less measure - no matter what scale the variables fall on (e.g. turnout, education, income), the number will always fall between \(-1\) and \(+1\)
r_plot <- function(r, n = 100){
  xy <- ecodist::corgen(len = n, r = r) %>%
    bind_cols
  
  ggplot(xy, aes(x, y)) +
    geom_point() +
    ggtitle(str_c("Pearson's r = ", r))
}

r <- c(.8, 0, -.8)

for(r in r){
  print(r_plot(r))
}

Types of graphs to cover/demo

  • Smoothing lines
    • Brief intro to the different stat_smooth() methods
  • Scatterplot matricies
    • Heatmaps of correlation coefficients
  • Parallel coordinate plots
  • Contour plots
  • 3D plots using plotly
    • Surface plots
    • Scatterplots
    • Line plots

Session Info

devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.3.3 (2017-03-06)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2017-04-18
## Packages ------------------------------------------------------------------
##  package    * version  date       source         
##  assertthat   0.1      2013-12-06 CRAN (R 3.3.0) 
##  backports    1.0.5    2017-01-18 CRAN (R 3.3.2) 
##  broom      * 0.4.2    2017-02-13 CRAN (R 3.3.2) 
##  colorspace   1.3-2    2016-12-14 CRAN (R 3.3.2) 
##  DBI          0.6      2017-03-09 CRAN (R 3.3.3) 
##  devtools     1.12.0   2016-06-24 CRAN (R 3.3.0) 
##  digest       0.6.12   2017-01-27 CRAN (R 3.3.2) 
##  dplyr      * 0.5.0    2016-06-24 CRAN (R 3.3.0) 
##  evaluate     0.10     2016-10-11 CRAN (R 3.3.0) 
##  forcats      0.2.0    2017-01-23 CRAN (R 3.3.2) 
##  foreign      0.8-67   2016-09-13 CRAN (R 3.3.3) 
##  ggplot2    * 2.2.1    2016-12-30 CRAN (R 3.3.2) 
##  ggthemes   * 3.4.0    2017-02-19 CRAN (R 3.3.2) 
##  gtable       0.2.0    2016-02-26 CRAN (R 3.3.0) 
##  haven        1.0.0    2016-09-23 cran (@1.0.0)  
##  hms          0.3      2016-11-22 CRAN (R 3.3.2) 
##  htmltools    0.3.5    2016-03-21 CRAN (R 3.3.0) 
##  httr         1.2.1    2016-07-03 CRAN (R 3.3.0) 
##  jsonlite     1.3      2017-02-28 CRAN (R 3.3.2) 
##  knitr      * 1.15.1   2016-11-22 cran (@1.15.1) 
##  lattice      0.20-34  2016-09-06 CRAN (R 3.3.3) 
##  lazyeval     0.2.0    2016-06-12 CRAN (R 3.3.0) 
##  lubridate    1.6.0    2016-09-13 CRAN (R 3.3.0) 
##  magrittr     1.5      2014-11-22 CRAN (R 3.3.0) 
##  memoise      1.0.0    2016-01-29 CRAN (R 3.3.0) 
##  mnormt       1.5-5    2016-10-15 CRAN (R 3.3.0) 
##  modelr     * 0.1.0    2016-08-31 CRAN (R 3.3.0) 
##  munsell      0.4.3    2016-02-13 CRAN (R 3.3.0) 
##  nlme         3.1-131  2017-02-06 CRAN (R 3.3.3) 
##  plyr         1.8.4    2016-06-08 CRAN (R 3.3.0) 
##  psych        1.7.3.21 2017-03-22 CRAN (R 3.3.2) 
##  purrr      * 0.2.2    2016-06-18 CRAN (R 3.3.0) 
##  R6           2.2.0    2016-10-05 CRAN (R 3.3.0) 
##  Rcpp         0.12.10  2017-03-19 cran (@0.12.10)
##  readr      * 1.1.0    2017-03-22 cran (@1.1.0)  
##  readxl       0.1.1    2016-03-28 CRAN (R 3.3.0) 
##  reshape2     1.4.2    2016-10-22 CRAN (R 3.3.0) 
##  rmarkdown    1.3      2016-12-21 CRAN (R 3.3.2) 
##  rprojroot    1.2      2017-01-16 CRAN (R 3.3.2) 
##  rvest        0.3.2    2016-06-17 CRAN (R 3.3.0) 
##  scales       0.4.1    2016-11-09 CRAN (R 3.3.1) 
##  stringi      1.1.2    2016-10-01 CRAN (R 3.3.0) 
##  stringr    * 1.2.0    2017-02-18 CRAN (R 3.3.2) 
##  tibble     * 1.2      2016-08-26 cran (@1.2)    
##  tidyr      * 0.6.1    2017-01-10 CRAN (R 3.3.2) 
##  tidyverse  * 1.1.1    2017-01-27 CRAN (R 3.3.2) 
##  withr        1.0.2    2016-06-20 CRAN (R 3.3.0) 
##  xml2         1.1.1    2017-01-24 CRAN (R 3.3.2) 
##  yaml         2.1.14   2016-11-12 cran (@2.1.14)

  1. A la U.S. Commercial Flights, 1995-2008.

  2. geom_smooth() automatically implements the gam method for datasets with greater than 1000 observations.